A cluster centers initialization method for clustering categorical data

نویسندگان

Liang Bai

Jiye Liang

Chuangyin Dang

Fuyuan Cao

چکیده

Keywords: The k-modes algorithm Initialization method Initial cluster centers Density Distance a b s t r a c t The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers. Currently, most methods of initialization cluster centers are mainly for numerical data. Due to lack of geometry for the categorical data, these methods used in cluster centers initialization for numerical data are not applicable to categorical data. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes algorithm. The method integrates the distance and the density together to select initial cluster centers and overcomes shortcomings of the existing initialization methods for categorical data. Experimental results illustrate the proposed initialization method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data objects. Clustering is a process of grouping a set of objects into clusters so that the objects in the same cluster have high similarity but are very dissimilar with objects in other clusters. Various types of clustering methods have been proposed and developed, see, for instance (Jain & Dubes, 1988). Clustering algorithms in the literature can generally be classified into two types: hierarchical clustering and partitional clustering. Hierarchical clustering algorithms, essentially heuristic procedures, produce a hierarchy of partitions of the set of observations according to an agglomerative strategy or to a divisive one. Partitional clustering algorithms, in general, assume a given number of clusters and, essentially, seek the optimization of an objective function measuring the homogeneity within the clusters and/or the separation between the clusters. is a well known partitional clustering algorithm which is widely used in real world applications such as marketing research and data mining to cluster very large data sets due to their efficiency. extended the k-means algorithm to propose the k-modes algorithm whose extensions have removed the numeric-only limitation of the k-means algorithm and enable the k-means clustering process to be used to efficiently cluster large categorical data sets from real world databases. Since first published, the k-modes algorithm has become a popular technique in solving categorical data clustering problems in different application domains (Andreopoulos, An, & Wang, 2005). The k-means algorithm and the k-modes algorithm use alternating minimization methods to …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cluster center initialization algorithm for K-modes clustering

Partitional clustering of categorical data is normally performed by using K-modes clustering algorithm, which works well for large datasets. Even though the design and implementation of K-modes algorithm is simple and efficient, it has the pitfall of randomly choosing the initial cluster centers for invoking every new execution that may lead to non-repeatable clustering results. This paper addr...

متن کامل

Clustering Categorical Data Using Community Detection Techniques

With the advent of the k-modes algorithm, the toolbox for clustering categorical data has an efficient tool that scales linearly in the number of data items. However, random initialization of cluster centers in k-modes makes it hard to reach a good clustering without resorting to many trials. Recently proposed methods for better initialization are deterministic and reduce the clustering cost co...

متن کامل

Initialization of K-modes clustering using outlier detection techniques

The K-modes clustering has received much attention, since it works well for categorical data sets. However, the performance of K-modes clustering is especially sensitive to the selection of initial cluster centers. Therefore, choosing the proper initial cluster centers is a key step for K-modes clustering. In this paper, we consider the initialization of K-modes clustering from the view of outl...

متن کامل

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Article history: Received 19 April 2010 Received in revised form 21 February 2011 Accepted 24 February 2011 Available online 2 March 2011

متن کامل

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering

The K-modes clustering algorithm is well known for its efficiency in clustering large categorical datasets. The K-modes algorithm requires random selection of initial cluster centers (modes) as seed, which leads to the problem that the clustering results are often dependent on the choice of initial cluster centers and non-repeatable cluster structures may be obtained. In this paper, we propose ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Expert Syst. Appl.

دوره 39 شماره

صفحات -

تاریخ انتشار 2012

A cluster centers initialization method for clustering categorical data

نویسندگان

چکیده

منابع مشابه

Cluster center initialization algorithm for K-modes clustering

Clustering Categorical Data Using Community Detection Techniques

Initialization of K-modes clustering using outlier detection techniques

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering

عنوان ژورنال:

اشتراک گذاری